Evaluation of machine learning algorithms for the prognosis of breast cancer from the Surveillance, Epidemiology, and End Results database

Introduction Many researchers used machine learning (ML) to predict the prognosis of breast cancer (BC) patients and noticed that the ML model had good individualized prediction performance. Objective The cohort study was intended to establish a reliable data analysis model by comparing the performance of 10 common ML algorithms and the the traditional American Joint Committee on Cancer (AJCC) stage, and used this model in Web application development to provide a good individualized prediction for others. Methods This study included 63145 BC patients from the Surveillance, Epidemiology, and End Results database. Results Through the performance of the 10 ML algorithms and 7th AJCC stage in the optimal test set, we found that in terms of 5-year overall survival, multivariate adaptive regression splines (MARS) had the highest area under the curve (AUC) value (0.831) and F1-score (0.608), and both sensitivity (0.737) and specificity (0.772) were relatively high. Besides, MARS showed a highest AUC value (0.831, 95%confidence interval: 0.820–0.842) in comparison to the other ML algorithms and 7th AJCC stage (all P < 0.05). MARS, the best performing model, was selected for web application development (https://w12251393.shinyapps.io/app2/). Conclusions The comparative study of multiple forecasting models utilizing a large data noted that MARS based model achieved a much better performance compared to other ML algorithms and 7th AJCC stage in individualized estimation of survival of BC patients, which was very likely to be the next step towards precision medicine.


Introduction
Breast cancer (BC) was the leading cancer in women, and BC alone accounted for 30% of newly diagnosed cancers in American women in 2019 [1]. Assessing the prognosis of BC patients could significantly affect the choice of the best treatment plan. For example, for patients with a poor prognosis, they may choose a more aggressive treatment. The most important predicting tool, the one that remained in worldwide use today, was the American Joint Committee on Cancer (AJCC) staging system [2]. There, however, were several evidence here that the traditional AJCC staging system could not accurately assess the prognosis of BC patients [3][4][5]. Many complex factors affected the prognosis of cancer patients, so survival prediction for cancer patients was a challenging task. In this context, modern oncology has witnessed the growing interest in digital technology, and the integration of digital technology and large medical data has brought new hope for personalized medicine.
Machine learning (ML) is a branch of artificial intelligence that employed a variety of statistical, probabilistic and optimization techniques that allowed computers to "learn" from past examples and to detect hard-to-discern patterns from large, noisy or complex data sets [6]. Many articles used ML to predict the prognosis of many cancer patients, including BC, lung cancer, and liver cancer, and noticed that the ML model had good individualized prediction performance . For example, Kalafi et al [12] presented that multilayer perceptron produced desirable prediction accuracy for predicting the prognosis of BC patients. Tahmassebi et al [13] proposed that extreme gradient boosting with multiparametric magnetic resonance imaging achieved stable performance for the early prediction of pathological complete response to neoadjuvant chemotherapy and of survival outcomes in BC patients. Poirion et al [14] introduced a novel ensemble framework of deep-learning and machine-learning approaches that robustly predicted BC patient survival subtypes using multi-omics data. A retrospective study on predicting 10-year survival after breast cancer surgery revealed that all performance indices for the deep neural network model were significantly higher than in the other forecasting models [16]. Liu et al [27] proposed a gradient boosting algorithm by optimizing survival analysis of XGBoost framework for ties to predict the disease progression of breast cancer. ML, therefore, was very likely to be the next step towards precision medicine.
Since ML models were susceptible to factors such as data sources, input variables, and software, several articles using ML to predict the prognosis of BC patients were controversial [7-9, 12-16, 20, 25-28]. Lotfnezhad Afshar et al [9] believed that support vector machine (SVM) model outperformed other models in the predicting the survival rate of BC patients. Moreover, Delen et al [30] indicated that the decision tree (DT) was the best predictor. Furthermore, a retrospective study proposed that random forest (RF) model showed a better diagnostic performance for predicting recurrence than did the five other machine learning classifiers [25]. Additionally, it should be mentioned that although some researchers claimed that these ML techniques could effectively predict the prognosis of patient, few people were actually used in clinical practice. This study was intended to establish a reliable data analysis model by comparing the performance of 10 common ML algorithms and the traditional AJCC staging system based on a national database, and used this model in Web application development to provide a good individualized prediction for others.

Database and samples
The Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute was an authoritative source of information on cancer incidence and survival in the United States and covered approximately 48.0% of the United States population [31]. Although the SEER database had some limitations, such as lack of certain data (such as postoperative complications, surgical margin, recurrence, etc.), its multi-center and large sample characteristics were suitable for building a ML model for the general population.
The data of BC patients for this study was acquired from the SEER database, and it included 154014 patients based on the fact that year of diagnosis was from 2010 to 2014, primary tumor site was coded as C50.0 to C50.6 (including C50.0-Nipple, C50.1-Central portion of breast, C50.2-Upper-inner quadrant of breast, C50.3-Lower-inner quadrant of breast, C50.4-Upperouter quadrant of breast, C50.5-Lower-outer quadrant of breast, C50.6-Axillary tail of breast), behavior recode for analysis was malignant, and diagnostic confirmation was positive histology. The study enrolled a total of 63145 patients by excluding patients with missing data and patients who survival time was less than 60 months and survival status was alive (Fig 1). The final endpoints of this study were the 5-year overall survival (OS) rate, so we excluded patients who survival time was less than 60 months and survival status was alive.

Statistical analysis
Categorical variables were presented as frequency and percentage, and continuous variables were presented as mean (x) and standard deviation (s). This study could only obtain the staging information of 7th AJCC due to the SEER database. For ML models, we extracted 15 factors that may affect the prognosis of patients from the SEER database based on professional knowledge, including age at diagnosis, gender, race, marital status at diagnosis, tumor site, origin of primary, grade, tumor size, N status, M status, breast subtype, surgery, regional lymph node dissection, chemotherapy, and radiotherapy (see the results for details). We used the Boruta package [32] in the R software for feature selection and found that 14 attributes other than origin of primary were confirmed important (Fig 2). It found relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies (shadows). We, therefore, included these 14 covariates in the 10 ML models.
In order to reduce the over-fitting of the model and ensure the robustness of the model, we used the 9-fold cross-validation method to select the test set with the centered area under the curve (AUC) value as the optimal test set by the caret package [33]. Using DeLong test to compare the AUC values of different ML algorithms and 7th AJCC stage in the optimal test set, the best performing model was selected for web application development by shiny package [34] and shinydashboard package [35]. We utilized the accuracy, F1-score, sensitivity, specificity, and AUC to evaluate the performance of models for each prediction case.
Common ML algorithms, such as naive bayes (NB), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-nearest neighbor (KNN), SVM, classification and regression trees (CART), RF, multivariate adaptive regression splines (MARS), logistic regression (LR), and extreme gradient boosting (XGBoost), were selected in this study. For each of these functions, we used the relevant package's default parameters, see below for details.
NB computed the conditional a-posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule. Although it assumed that the presence/ absence of a characteristic describing a certain class was unrelated to the presence/absence of any other characteristic, which was not true for the majority of classification tasks, NB have been successful in complex practical applications [36]. The analysis of NB in this study was realized by the e1071 package in R [37].
Discriminant analysis was to summarize the rules from the various classifications of the known samples to determine the type of the new sample, including LDA and QDA. The difference between the two was that LDA was based on the assumption that the variables were multivariate normally distributed in each group with different mean vectors and identical covariance matrices, and while the equality of covariance assumption was not required in QDA, so this was the basic reason that LDA was a much less flexible classifier than QDA [38]. The discriminant analysis in this study was realized by the MASS package in R [39].
KNN, a nonparametric clustering algorithm, was used for data classification and regression [40], which predicted the information of the test sample through the information of the k training samples closest to the test sample in the train set. The analysis of KNN in this study was realized by the kknn package in R [41].
The basic idea of SVM was to get the separation hyperplane that could divide the data set correctly and had the largest geometric interval, and used the hyperplane to reasonably divide the data. The analysis of SVM in this study was realized by the kernlab package in R [42].
The CART model, a machine-learning and data-mining recursive algorithm, was used to identify groups of patients with a homogeneous risk of death and investigate the hierarchical association between variables and survival [43]. No pruning was done on the model. The analysis of CART in this study was realized by the rpart package in R [44].
RF was an ensemble learning method based on decision tree. In this study, the min number of trees grown obtained by randomForest package [45] in the train set was 441, and it was verified in the test set.
MARS was a non-parametric modelling method that extends the linear model, incorporating nonlinearities and interactions between variables. It was a flexible tool that automated the construction of predictive models [46]. The analysis of MARS in this study was realized by the earth package in R [47].
LR was one of the most important models in generalize linear model (GLM). It was mainly used to study the relationship between two-element categorical response variables ("success" and "failure" are represented by 1 and 0 respectively) and many covariates, and to establish corresponding models and make predictions.
The algorithm of XGBoost was a gradient-boosting decision tree that can be used for both classification and regression problems [48]. The greedy method optimized the maximal gain of the objective function during the construction of each tree layer [49]. The analysis of XGBoost in this study was realized by the xgboost package [50].
To further analyze the best performing model, we needed to evaluate the variable importance in this model. According to the Results section, MARS was the best performing model. We used three criteria to estimate the variable importance of the model through the evimp functions that came with the earth package [51]: (i)The nsubsets criterion counted the number of model subsets that included the variable. Variables that were included in more subsets were considered more important. (ii)The residual sum-of-squares (RSS) criterion first calculated the decrease in the RSS for each subset relative to the previous subset during earth's backward pass. Then for each variable it summed these decreases over all subsets that included the variable. Finally, for ease of interpretation the summed decreases were scaled so the largest summed decrease was 100. Variables which caused larger net decreases in the RSS were considered more important. (iii)The generalized cross validation (GCV) criterion was the same, but used the GCV instead of the RSS. Adding the variable had a deleterious effect on the model, as measured in terms of its estimated predictive power on unseen data. Statistical analysis were conducted using R software 4.1.0.
Ethics statement was not required for this study, because this observational study used deidentified and publicly available data from SEER. This study was conducted in accordance with the Declaration of Helsinki. In addition, Data-Use Agreements for the 1975-2017 SEER Research Data File and SEER Radiation Therapy and Chemotherapy Information were signed and the database can be accessed.

Baseline characteristics
Descriptive characteristics of 63145 BC patients were summarized in Table 1. The average age of the patients was 62.6 ± 13.8 years, and 81.1% of the patients were the white. As of the follow-up time (November 2019), a total of 15734 patients died, and the 5-year OS was 75.1%.

Machine learning algorithms and 7th AJCC stage
Through the performance of the 10 ML algorithms and 7th AJCC stage in the test set (Tables  2, 3 and Fig 3), the results showed that in terms of 5-year OS, LDA had the highest accuracy (0.771), higher specificity (0.806) and higher AUC value (0.813), but lower sensitivity (0.665). MARS had the highest AUC value (0.831) and F1-score (0.608), and both sensitivity (0.737) and specificity (0.772) were relatively high. Besides, MARS showed a highest AUC value (0.831, 95%confidence interval: 0.820-0.842) in comparison to the other ML algorithms and 7th AJCC stage (all P < 0.05, Table 3). The best forecasting ability among these models was MARS. The algorithms with the highest sensitivity was RF (0.763). KNN showed the highest specificity (0.807) and the lowest sensitivity (0.596).

Evaluating variable importance in the MARS model
By evaluating variable importance in the MARS model, we noticed that age at diagnosis was considered the most important variable, followed by tumor size, M status, regional lymph node dissection, N status, Breast subtype, and so on ( Table 4).

Web application development
We selected MARS model for web application development for other users to use for free based on the AUC value (https://w12251393.shinyapps.io/app2/). This web application could automatic calculate the 5-year OS according to the characteristics of the patient selected by the user.

Discussion
ML models could be defined as a process of designing a model and improving its performance through empirical learning. It were a field of artificial intelligence and an active research field in different scientific fields. Complex ML models could pick up on subtler patterns in input data and thus could be more effective predictors [52]. ML, therefore, was very likely to be the next step towards precision medicine.
In our research, ROC curve analysis showed that the AUC value of the 7th AJCC stage was 0.683 (95%CI: 0.669-0.698, Fig 3A), which was in the range of 0.620 to 0.728 previously studied [53][54][55]. The research conducted DeLong test on more than 60000 BC patients and found that the 10 ML algorithms had a better role in predicting the 5-year OS compared to 7th AJCC stage (all P < 0.001, Table 3). In the meantime, the 7th AJCC stage showed the lowest accuracy (0.612) and F1-score (0.473). There, for all we know, were no relevant researches comparing the predictive ability of AJCC stage and ML models for BC patients. We could only obtain the staging information of 7th AJCC due to the guarantee of a 5-year follow-up period and the limitation of the SEER database, while some researchers believed that the latest 8th AJCC stage still could not accurately  stratify the prognosis for BC patients [4,56]. For example, a study about the comparison of the prognostic accuracy of the 8th AJCC prognostic staging system to the 7th staging system using data from over 168000 BC patients confirmed the enhanced value of the 8th AJCC, while the latter still needed further improvement [56]. Furthermore, though several research results noticed that the AUC value of AJCC stage had risen from the 0.620-0.728 range of the 7th edition to the 0.670-0.773 range of the 8th edition [53][54][55], there was still a certain distance from the AUC value of MARS model in this study (AUC: 0.831, 95%CI: 0.820-0.842). The result noted that MARS had the best performance among the 10 ML algorithms and 7th AJCC stage in predicting the 5-year OS of BC patients (Tables 2 and 3). MARS was a nonparametric modelling method that extended the linear model, incorporating nonlinearities and interactions between variables. It was a flexible tool that automated the construction of predictive models [46]. There was currently no study using the MARS model to predict the prognosis for BC patients, to the best of our knowledge. Several articles using different ML models to predict the prognosis of BC patients aroused controversy [7-9, 14-16, 20, 25-28, 57]. Firstly, Kate et al [7] believed that NB was better than DT and LR through the research on more than 160000 BC patients. Moreover, a meta-analysis of 11 articles about ML algorithms   for BC risk calculation confirmed that the SVM algorithm was able to calculate breast cancer risk with better accuracy value than other ML algorithms [57], but this article did not include the MARS algorithm. Since ML models were susceptible to factors such as data sources, input variables, and software, and the number of ML algorithms compared by many studies was different, so it was difficult to directly compare with the results of other studies.
To our surprise, this study noted that age at diagnosis was considered the most important variable, even ahead of distant metastasis. While several findings proposed that age was as an independent prognostic factor for BC [58][59][60], by it has been well documented that metastases was the main cause of death for patients with breast cancer [61]. Estimating predictor importance, as everyone knows, was in general a tricky and even controversial problem. The evimp function was useful in practice for MARS model but the following issues could make it misleading [51]. For example, collinear (or otherwise related) variables could mask each other's importance, just as in linear models; this meaned that if two predictors were closely related, the earth model building algorithm would somewhat arbitrarily choose one over the other [51]. The chosen predictor would incorrectly appear more important [51]. So estimates of predictor importance could be unreliable because they could vary with different training data.
Nonetheless, there were some advantages of this research. Firstly, the data of this study came from the SEER database, which was one of the most representative large tumor databases in North America. Moreover, we compared the accuracy, F1-score, sensitivity, specificity, and AUC values of 10 ML algorithms in detail and reported the P values of the AUC values, while some other studies used less than 5 ML algorithms and rarely reported their P values [7][8][9]14]. More significantly, we used the selected ML model in Web application development to provide a good individualized prediction for others online.
This study also had several limitations. Firstly, the SEER database lacked some data effected on the prognosis of patients, such as postoperative complications, surgical margin, and recurrence. Secondly, the models in this study were all trained and tested on different parts of the same data set. Ideally, the model would be trained on one data set and validated on another separately studied data set. This external verification could prove the universality of the model. We could not use another external data set for external verification, so we had to divide the data set into train set and test set. Although this research used the 9-fold cross-validation method to reduce the over-fitting of the model and ensure the robustness of the model, whether these ML models could be well generalized to new data sets required further research. Thirdly, compared with traditional statistical models, ML algorithms had black box characteristics. Interpretation and understanding of the ML model was a key issue. Fourthly, since the sample size of this study exceeded 60000 and computation time required to use deep learning was too long, this study did not test deep models, which may affect the results of this study. Future work could be carried out to get a more accurate predictive model by including more ML algorithms, such as deep learning modes.

Conclusions
The comparative study of multiple forecasting models utilizing a large data noted that MARS based model achieved a much better performance compared to other ML algorithms and 7th AJCC stage in individualized estimation of survival of BC patients, which was very likely to be the next step towards precision medicine.